Identification of network modules by optimization of ratio association 
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We introduce a novel method for identifying the modular structures of a network based on the 
maximization of an objective function: the ratio association. This cost function arises when the 
communities detection problem is described in the probabilistic autoencoder frame. An analogy 
with kernel k-means methods allows to develop an efficient optimization algorithm, based on the 
deterministic annealing scheme. The performance of the proposed method is shown on a real data 
set and on simulated networks. 
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The structure of a complex network may be de- 
scribed by identifying the modules of which it is 
composed. The concept of module is qualitative: 
nodes are more connected within their modules 
than with the rest of the network, and its quan- 
tification is still a subject of debate. Modularity, 
a quantity related to the correlation between the 
probability of having an edge joining two sites and 
the fact that the sites belong to the same modules, 
has been widely accepted as a measure for module 
identification. Here we provide a new description 
of this important problem. We analyze the use 
of a novel objective function, the ratio associa- 
tion, measuring the coherency between modules. 
Ratio association emerges in the probabilistic au- 
toencoder frame, performing a lossy compression 
of the network's structures. An analogy to kernel 
k-means allows the development of an efficient al- 
gorithm for the optimization of ratio association. 
The power of the proposed technique is assessed 
by showing the structures found by ratio optmiza- 
tion on a real data-set and on simulated networks. 
The likelihood of the probabilistic autoencoder 
model may be used to select the optimal number 
of modules. 



INTRODUCTION 

A hierarchical structure of modules characterizes the 
topology of most of real- world networked systems [l[ . In 
social networks, for instance, these modules are densely 
connected groups of individuals belonging to social com- 
munities. Modules (called also community structures) are 
defined as tightly connected subgraphs of a network, i.e. 
subsets of nodes within which the density of links is very 
high, while between which connections are much sparser. 
These tight-knit modules constitute units that separately 
(and in parallel) contribute to the collective functioning 
of the network. For instance, the presence of subgroups 



in biological and technological networks is at the basis of 
their functioning. Hence the issue of detecting and char- 
acterizing module structures in networks received consid- 
erable amount of attention. 

Rigorously, the identification of the hierarchy of mod- 
ules of a network is equivalent to the graph partitioning 
problem in computer science, which is known to be a NP- 
complete problem Q . A series of efficient heuristic meth- 
ods has been proposed over the years to cope with this 
problem. These include methods based on spectral analy- 
sis [3] , or hierarchical clustering methods developed in the 
context of social networks analysis Among the dif- 
ferent techniques, we recall the modularity identification 
based on the statistical properties of a system of spins 
and hierarchical clustering techniques exploiting the 
central concept of modularity [g, 0] • The modularity Q 
is a measure of the correlation between the probability 
of having an edge joining two sites and the fact that the 
sites belong to the same modules (see Ref. [7] for the 
mathematical definition of Q). Methods directly based 
on the optimization of Q have been proposed 0, B, w hile 
recently a spectral technique has been introduced (l0( ex- 
ploiting the information of the modularity matrix, that, 
for a given graph, has the property of being dense even 
when the adjacency matrix is sparse. 

Furthermore, another recent stream of research has 
been initiated by the relevant observation that topologi- 
cal hierarchies are associated to dynamical time scales in 
the transient of a synchronization process [ill ]. Such an 
observation inspired the introduction of a fast technique 
able to detect and identify the modules of a complex 
network from the cluster de-synchronization scenario of 
phase oscillators [12]. 

In this paper, we introduce a new technique for mod- 
ules identification, that efficiently optimizes an objec- 
tive function called ratio association. Precisely, once the 
number of modules n c is fixed, the optimization process 
leads to a fast (in a time scaling linearly with the number 
of nodes N in the network) detection of the correspond- 
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FIG. 1: The Zachary karate club network. The two mod- 
ules identified by the proposed algorithm are colored in grey 
and white, respectively. Squares and circles indicate the two 
real communities described by Zachary [211 ]. Notice that our 
technique fully reveals the true subdivision. 
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ing modules. The efficiency of the algorithm is due to an 
equivalence with the kernel k- means (l3| . which we ex- 
ploit in the deterministic annealing frame We show 
that the optimization of ratio association may be moti- 
vated in the probabilistic autoencoder frame, a paradigm 
which has been used to derive cost functions for data clus- 
tering [lij |: the same cost function has been used in [l6| 
for classification of time series data. In order to select 
the number of modules n c , the quality of the solution is 
to be assessed. This can be achieved in two ways, on the 
basis of the modularity of the solution, or according to 
the likelihood of the probabilistic autoencoder model. 

The paper is organized as follows. In the next Section 
we describe our method, while in Section 3 some applica- 
tions are shown. Section 4 summarizes our conclusions. 



THE METHOD 



Given a set of data vectors {x.;}^, with x.; £ R", the 
goal of the kernel k-means is to find a q-way disjoint par- 
tition [ijj {tTc}c=i °f the data (where 7r c represents the 
c-th cluster) such that the following objective function is 
minimized: 
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where 



(i) 



(2) 



Here, \tt c \ is the cardinality of the subset ir c , and $ 
is a function mapping the x vectors onto a (generally) 
higher-dimensional space (if $ is the identity function, 
the above Equations recover the standard definition of 
k-means). 



FIG. 2: The objective function T for the karate club modular 
structure found by the proposed algorithm, vs. the number 
of communities nc ■ Here a — 2, but results are stable against 
variations of a. 



Expanding the distance term ||$(xj) — m c || 2 in the 
objective function, one obtains 



$(Xi) • $(Xi) 
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Notice that in Eq. ([3]) , all computations involving data 
points are in the form of inner products. As a result, 
one can use the kernel trick if one can compute the dot 
product Kij = $(xj)-$(xj) efficiently, then one is able to 
compute distances between points in this mapped space 
without having to explicitly know the mapping of x^ and 
xj onto ^(xi) and $(xj), respectively. It is known that 
any positive semi-definite matrix K can be thought of as 
a kernel matrix [3]. Using the kernel matrix, Eq. ([1]) 
can be rewritten as: 
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Suppose that the graph G = (V, A) is given, where V 
is the set of N vertices and A is the adjacency matrix 
[the elements Aij are one (zero) whenever an edge is (is 
not) present between vertices i and j) . If A and B are two 
disjoint subsets of V, we furthermore define links(^4, £>) = 

The idea is to fix the number of modules n c into which 
we want to efficiently partition the original graph, and 
to look for the n c -way disjoint partition of V ({7Tc}™=i)j 
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FIG. 3: The modularity Q for the karate club modular struc- 
ture found by the proposed algorithm, vs. n c . As in Fig. [2] 
we use here a — 2, but results are stable against variations of 
a. 



FIG. 4: The ratio association 1Z for the karate club modular 
structure found by the proposed algorithm, vs. nc- Same 
stipulations on a as in the captions of Figs. [21 El 



that maximizes the following objective function, called 
ratio association 191: 



7MK}?=i)=E 
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Let us now associate to the given graph a N x N kernel 
matrix as follows: 



K = al + A. 



(6) 



where I is the identity matrix, and a is a real number 
chosen to be sufficiently large so that K comes out to be 
positive definite. Now, given a n c -way disjoint partition 
{ 7r c}"=i °f the graph, the corresponding value of the ratio 
association and the objective function of kernel k-means 
are related as follows: 



l{Uc}% x ) = {N -n c )a-n{{-K c }% x ) 



(7) 



An important point follows: X attains its minimum 
in correspondence of the same partition providing the 
maximum of 1Z, independently of a, as it was shown in 
Ref. [2(| when considering the standard iterations of k- 
means. Therefore, the kernel k-means minimization may 
be straightforwardly used to find the n c optimal clus- 
tering of the graph, by simply maximizing the ratio as- 
sociation. The ratio association may be derived in the 
probabilistic autoencoder frame as described in the ap- 
pendix. 

Hence, we can use graph clustering to discover mod- 
ules structures. As we here deal with modular structures 
maximizing the ratio association, in the following we will 
handle the optimization problem by deterministic anneal- 
ing. 



Let pi C be the probability that vertex i belongs to the 
c-th module. We write 



Pic - —l3C ic , ' 



where, according to ((H), 
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(9) 



and K is given by ©• 

Starting from a random configuration of {p} and {£}, 
Eqs. ([8][9|) are solved iteratively while exponentially in- 
creasing p. At large /3, pi(c) are all zero except for one 
element providing the module to whom the vertex i has 
to be assigned. 

Notice that the annealing procedure leads to a final 
partition of vertices which still has a tiny dependancc 
on the starting configuration, hence the algorithm is to 
be run several times, selecting the partition leading to 
the lowest value of X. As is typical of deterministic an- 
nealing approaches, the complexity of the algorithm is 
O (n c N(z)), where (z) is the average number of edges per 
vertex. Note, however, that in the proposed method the 
number of modules is to be specified in advance. There- 
fore, for a full hierarchical description of the original net- 
work (i.e. when one wants also to determine the optimal 
n c ), the algorithm has to be run by varying n c between 
its minimum (2) and its maximum (N) value, leading to 
an overall complexity O (A^ 3 (z)). 

The choice of a deserves few comments. As said be- 
fore, to enforce the positive definiteness of K, and thus 
to establish the connection to kernel k-means, a must 
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FIG. 5: AL = L — Lo (see the text) is plotted versus n c , for 
the Zachary network. 

be sufficiently large. However, since varying a does not 
change the global optimum, one may choose a in the 
most convenient way from the computational point of 
view, even though K will not be ensured to be positive 
definite. 
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FIG. 6: The fraction p of correctly classified nodes is plotted 
versus z ou t, the average number of edges a node forms with 
members of other modules, for the proposed algorithm (stars), 
for the Girvan and Newman method [7| (empty squares) and 
for the method introduced by Duch and Arenas (empty 
circles). Each point refers to an ensemble average over 100 
different network realizations. Same stipulations for a = 2 as 
in the Captions of Figs. 03] 



APPLICATIONS 

Let us first discuss the application of the proposed 
method to the well-known Zachary karate club network 
2l|, shown in Figure [T] We point out that the output of 
the algorithm is independent on a € [0, 20] (notice that 
K, in this case, is positive for a > 5). 

When selecting n c = 2 (i.e. when trying to split the 
network in two modules) , Figure [T] shows that one fully 
recovers the true subdivision of the data set, with 1Z = 
8.0139 and modularity Q = 0.3715. 

When repeating the analysis at varying n c , Figure[2]re- 
ports the value of 2~, corresponding to the solution, as a 
function of n c : it is a strictly decreasing function. In Fig- 
ure [3] we plot the modularity Q of the solution versus nc- 
the maximum is Q — 0.420 and corresponds to a parti- 
tion of the graph into four modules, in perfect agreement 
with the outcome of other techniques previously tested 
on the Zachary karate club network. Finally, Figure [4] 
reports the ratio association 1Z versus nc, making it ev- 
ident the validity of Eq. |7]). 

The selection of n c may also be done on the basis of 
the average log-likelihood of the autoencoder (see the ap- 
pendix). In Figure [5] we plot L — Lq vs n c , where L is 
the average log- likelihood of the data set, whilst Lq is 
the same quantity evaluated on a network with the same 
number of nodes and links but with links randomly as- 
signed to pairs of nodes [22]. According to the criterion 
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FIG. 7: Top: AL = L — Lo (see the text) is plotted versus n c , 
for a randomly generated network with z ou t = 4. Bottom: the 
modularity of the modular structure found by the proposed 
algorithm on the same network. 



of the largest AL = L — Lo, both n c = 2 and n c = 4 are 
suitable partitions. 

To evaluate the performance of the proposed tech- 
nique, we generate a set of random graphs featuring a 
well defined modular structure. Precisely, all graphs are 
generated with N — 128 nodes and K = 1024 edges. 
The nodes are distributed into four modules, containing 
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tion of the communities detection problem, in terms of 
a lossy compression of the structures, as well as a new 
selection strategy for the number of modules by means 
of the log-likelihood of the model. 
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APPENDIX 



FIG. 8: The scaling of the CPU time is reported as a function 
of the number of vertices N, at fixed n c . 

32 nodes each. Pairs of nodes belonging to the same 
module (to different modules) are linked with probabil- 
ity Pin (Pout ) ■ Pout is taken so that the average number 
z ut of edges a node forms with members of other com- 
munities can be controlled (in our trials z ou t has been 
varied between and 10). pi n is chosen so as to maintain 
a constant total average node degree < k >= 16. No- 
tice that, as z out increases, the modular structure of the 
network becomes weaker and harder to identify. As the 
real modular structure is here directly imposed by the 
generation process, the performance of the identification 
method can be assessed by monitoring the fraction p of 
correctly classified nodes vs. z out . In Figure O we report 
a comparative analysis of p vs. z out for the proposed al- 
gorithm, for the Girvan and Newman method l7fl and for 
the method introduced by Duch and Arenas [9(. The re- 
sult is that the accuracy attained by our method comes 
out to be slightly better than that of Ref. 0] . Note that 
while applying our algorithm to these networks, we have 
selected n c by maximizing L — Lq (similar results are 
obtained maximizing the modularity, see Figure [7j). 

We finally report in Figure [8] the CPU time needed to 
complete a given partition as a function of the number 
of vertices. The curve confirms that, for a given n c , the 
computational demand scales linearly with the network 
size. 



CONCLUSIONS 

In conclusion, we introduced a novel method for iden- 
tifying the modular structures of a network based on the 
maximization of an objective function: the ratio asso- 
ciation. This objective function emerges in the frame of 
probabilistic autoencoders, thus providing a new descrip- 



In this appendix we show that ratio association may 
be derived in the Probabilistic Autoencoder Framework. 
We briefly discuss autoencoders described by one-stage 
folded Markov chains [HI]. Let us consider a point x, 
in a data space, sampled with probability distribution 
Po (x); a code index a £ {1, . . . , q} is assigned to x ac- 
cording to conditional probabilities P(a\x). A recon- 
structed version of the input, x', is then obtained by use 
of the Bayesian decoder: 

P(lV) = ™p. (10) 

The joint distribution of x, x' and a, describing this 
encoding-decoding process, is 

P (x, x', a) = P (x) P (a\x) P (x'\a) ; (11) 

owing to (fTU|) . the joint distribution reads: 

P (x, x', a) = ft M ft '^W fW . (12) 

The conditional probabilities {P(a|x)} are the free pa- 
rameters that must be adjusted to force the autoencoder 
to emulate the identity map on the data space. 

Let s(x,x') be a measure of the similarity between input 
and output; the average similarity is then given by 

s =± j^j^^M^^miwi aM . 

(13) 

A good autoencoder is obviously characterized by a high 
value of S. Given a set of data vectors {xi}^L 1 , partition- 
ing these points in q modules corresponds, in this frame, 
to design an autoencoder, with q code indexes, acting 
on data space. Choosing the encoder to be deterministic 
leads to: 

P(a\x) = <5 a , 7(x ), (14) 
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j(x) e {1, . . . ,q} being the code index associated to x. 
The estimate for the average similarity, based on the 
data-set at hand, is given by : 



S = 



N ^ 



i V JV 8 8s 



7 i ° V 



(15) 



If the similarity matrix Sij is identified with the kernel 
matrix K, we obtain: 



NS 



K. 



Therefore maximization of the ratio association is equiv- 
alent to design the most effective autoencoder, the effec- 
tiveness being measured by the average similarity. 

Now we consider the average log-likelihood [23| of data 

!x,!; v ,: 



9 

E 

7=1 



dxP (x, 7) logP (x|7). 



(16) 



We may easily obtain an estimate of this quantity, which 
measures how good is the autoencoder frame to model 
the data-set. Using kernel density estimation [24| we eas- 
ily obtain: 



1 / links ({i},7r 7 (i)) 



T 7 (i) I 



(17) 



the numerator, in the formula above, is the number of 
links from node i to nodes in the same module as i, 
whereas the denominator is the cardinality of the module 
whom i belongs to. 
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